Autumn Wilcox – Contributed to analytic coding, content draft, structured the project workflow, and collaborated actively via GitHub.
Ecil Teodoro – Drafted and submitted sections of the Introduction.
Namita Mishra – Developed the project plan, content draft, analytic coding, and coordinated commits and collaboration with the group on GitHub.
Advisor: Dr. Ashraf Cohen – Provided expert guidance on statistical methods and workflow management across platforms.
Methods Summary
Data Source: NHANES adults dataset Handling Missing Data: Multiple Imputation by Chained Equations (MICE) Survey Weighting: Accounted for population representativeness
Modeling Approaches:
Survey-weighted logistic regression (MLE) Multiple Imputation logistic model Bayesian logistic regression using NUTS sampling (4 chains, 2000 iterations)
Model Diagnostics: R̂ ≈ 1.00; High ESS → Excellent convergence Posterior Predictive Checks: Compared model predictions to observed prevalence
Introduction
Diabetes mellitus (DM) is a major public health concern closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for targeted interventions D’Angelo and Ran (2025). Logistic Regression (traditional) that estimates the association between risk factors and outcomes is insufficient in analyzing the complex healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in samples that are small, have missing data, or presents quasi- and complete separation.
Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) allow analysis of multivariate longitudinal healthcare data with repeated measures within individuals and individuals nested in a population. By integrating prior knowledge and including exogenous (e.g., age, clinical history) and endogenous (e.g., current treatment) covariates, Bayesian models provide posterior distributions and risk predictions for conditions such as pneumonia, prostate cancer, and mental disorders. Parametric assumptions remain a limitation of these models.
In Bayesian inference Chatzimichail and Hatjimihail (2023), Bayesian inference has shown that parametric models (with fixed parameters) often underperform compared to nonparametric models, which do not assume a prior distribution. Posterior probabilities from Bayesian approaches improve disease classification and better capture heterogeneity in skewed, bimodal, or multimodal data distributions. Bayesian nonparametric models are flexible and robust, integrating multiple diagnostic tests and priors to enhance accuracy and precision, though reliance on prior information and restricted access to resources can limit applicability. Combining Bayesian methods with other statistical or computational approaches helps address systemic biases, incomplete data, and non-representative datasets.
The Bayesian framework described by R. van de Schoot et al. (2021) highlights the role of priors, data modeling, inference, posterior sampling, variational inference, and variable selection.Proper variable selection mitigates multicollinearity, overfitting, and limited sampling, improving predictive performance. Priors can be informative, weakly informative, or diffuse, and can be elicited from expert opinion, generic knowledge, or data-driven methods. Sensitivity analysis evaluates the alignment of priors with likelihoods, while MCMC simulations (e.g., brms, blavaan in R) empirically estimate posterior distributions. Spatial and temporal Bayesian models have applications in large-scale cancer genomics, identifying molecular interactions, mutational signatures, patient stratification, and cancer evolution, though temporal autocorrelation and subjective prior elicitation can be limiting.
Bayesian normal linear regression has been applied in metrology for instrument calibration using conjugate Normal–Inverse-Gamma priors Klauenberg et al. (2015). Hierarchical priors add flexibility by modeling uncertainty across multiple levels, improving robustness and interpretability. Bayesian hierarchical/meta-analytic linear regression incorporates both exchangeable and unexchangeable prior information, addressing multiple testing challenges, small sample sizes, and complex relationships among regression parameters across studies Leeuw and Klugkist (2012)
A sequential clinical reasoning modelLiu et al. (2013) Sequential clinical reasoning models demonstrate screening by adding predictors stepwise: (1) demographics, (2) metabolic components, and (3) conventional risk factors, incorporating priors and mimicking clinical evaluation. This approach captures ecological heterogeneity and improves baseline risk estimation, though interactions between predictors and external cross-validation remain limitations.
Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) in clinical research by classifying missing values (e.g., patient refusal, loss to follow-up, mechanical errors) as MAR, MNAR, or MCAR. Multiple imputation generates plausible values across datasets and pools results for reliable classification of patient health status and mortality.
Aims
The study aims to apply Bayesian logistic regression to predict diabetes status and to evaluate the associations between body mass index (BMI), age (≥20 years), gender, and race as predictors, using a retrospective dataset from the 2013–2014 NHANES survey. NHANES employs a complex sampling design, including stratification, clustering, and oversampling of specific population subgroups, rather than uniform random sampling. A Bayesian analytical approach is used to address challenges posed by dataset anomalies such as missing data, complete case analysis, and separation that limit the efficiency and reliability of traditional logistic regression in predicting health outcomes.
Method and Data Preparation
In this study, Bayesian logistic regression is applied to predict diabetes status and estimate the association of key predictors including body mass index (BMI), age, gender, and race using the 2013–2014 NHANES dataset. The complex survey design of NHANES, includes stratification, clustering, and oversampling, was accounted for to ensure accurate estimation. Posterior predictive distributions are generated to quantify the probability of diabetes for each individual, providing a flexible and robust framework for uncertainty quantification. Model predictions are also compared against known population prevalence, enabling validation and assessment of model realism. By integrating prior information with observed data, this approach identifies individuals at high risk for diabetes and informs targeted public health interventions, demonstrating the utility of Bayesian methods in analyzing complex, multivariate health data.
Statistical Tool: we use R, R packages and libraries to import data,
perform data wrangling and analysis.
Data source: NHANES 2-year data is a cross-sectional weighted data
(2013-2014 year) Center for Health Statistics (1999). Three datasets (demographics, exam, questionnaire) imported (Haven package to coverted .XPT files in R to dataframe (df)) and after selecting variables of interest a merged dataset was created.
Data pre-processing and cleaning Subsets created from the original weighted datasets (demographics, exam, questionnaire) with selected variables merged using ID to create a single dataframe.
Response Variable: Binary, type 2 / diagnosed diabetes (excluding gestational diabetes) -“Doctor told you have diabetes?” DIQ010 combined with DIQ050 a secondary variable describing treatment status (insulin use) to exclude those cases
Predictor Variables (Body Mass Index, factor, 4 levels are analyzed after standardization).
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups
Covariates:
Gender (factor, 2 levels): Male: Female
Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including multi-racial
Age (number, continuous)
Merged dataset cleaned and exploratory data analysis results and visualizations are presented for 10175 observations and 10 variables with race categorized in 5 levels, age range (0-80 years), gender (male and female), Diabetes grouped from (DIQ010 and DIQ050) BMI as continuous.
Code
# weighted means of each variable str(merged_data)
cat("Count with DIQ010 in {1,2}:", sum(merged_data$DIQ010 %in%c(1,2), na.rm =TRUE), "\n")
Count with DIQ010 in {1,2}: 9578
Code
# ---- Save to file for reuse ----dir.create("data", showWarnings =FALSE)# ---- Save ----dir.create("data", showWarnings =FALSE, recursive =TRUE)saveRDS(merged_data, "data/merged_2013_2014.rds")message("Saved: data/merged_2013_2014.rds")
Exploratior data analysis - Used library(survey) to get weighted means and sd of the variables. The BMI and age were standardized. - Age was recoded into different variables, including only >20 years in the analysis. - BMI is recoded and categorized as-“18.5,18.5–<25,25–<30,30–<35,35–<40,≥40 years). - Ethnicity is recoded as”Mexican American” = “1”, “Other Hispanic” = “2”, “NH White” = “3”, “NH Black” = “4”, “Other/Multi” = “5” - Since special codes are not random, cannot be dropped; the informative missingness if ignored (MAR or MNAR) could introduce bias. We transformed special codes (3,7,) to NA and included all NAs in the analysis. Visulaization of missing data presented below. - A final analytic dataset was created (‘adult’) with “NH White” and “Male” as the reference group
Code
## # ---------------- Basic Exploration (adults) ----------------# Keep adults only and build analysis variablesadult <- merged_data %>% dplyr::filter(RIDAGEYR >=20) %>% dplyr::transmute(# --- keep survey design variables so svydesign() can see them --- SDMVPSU, SDMVSTRA, WTMEC2YR,# --- outcome: DIQ010 (1 yes, 2 no; 3/7/9 -> NA) ---diabetes_dx = dplyr::case_when( DIQ010 ==1~1, DIQ010 ==2~0, DIQ010 %in%c(3, 7, 9) ~NA_real_,TRUE~NA_real_ ),# --- predictors (raw) ---bmi = BMXBMI,age = RIDAGEYR,# sex (1=Male, 2=Female)sex = forcats::fct_recode(factor(RIAGENDR), Male ="1", Female ="2"),# race (5-level)race = forcats::fct_recode(factor(RIDRETH1),"Mexican American"="1","Other Hispanic"="2","NH White"="3","NH Black"="4","Other/Multi"="5" ),# keep DIQ050 so we can safely reference it (may be absent/NA in some rows)DIQ050 = DIQ050 ) %>%# standardize continuous predictors dplyr::mutate(age_c =as.numeric(scale(age)),bmi_c =as.numeric(scale(bmi)),bmi_cat =cut( bmi,breaks =c(-Inf, 18.5, 25, 30, 35, 40, Inf),labels =c("<18.5","18.5–<25","25–<30","30–<35","35–<40","≥40"),right =FALSE ) ) %>%# adjust outcome: if female & DIQ050==1 ("only when pregnant"), set to 0 (not diabetes) dplyr::mutate(diabetes_dx =ifelse(sex =="Female"&!is.na(DIQ050) & DIQ050 ==1, 0, diabetes_dx) )# Make NH White the reference level for race (clearer interpretation)adult <- adult %>% dplyr::mutate(race = forcats::fct_relevel(race, "NH White") )# --- sanity checks ---cat("Adults n =", nrow(adult), "\n")
NHANES is a national surveys based on complex sampling designs (oversampling certain groups (e.g., minorities, older adults) to ensure representation. They use multistage sampling to represent the U.S. population, so we apply sampling weights, strata, and PSU (primary sampling units) for valid estimates.
We use survey design in regression anlaysis to avoid - - bias prevalence estimates (e.g., mean BMI or diabetes %) - underestimation of standard errors - incorrect inference for population-level parameters.
Code
# data explorationprint(table(adult$diabetes_dx, useNA ="ifany"))
0 1 <NA>
4974 618 177
Code
print(table(adult$sex, useNA ="ifany"))
Male Female
2758 3011
Code
print(table(adult$race, useNA ="ifany"))
NH White Mexican American Other Hispanic NH Black
2472 767 508 1177
Other/Multi
845
Code
if (sum(!is.na(adult$diabetes_dx)) ==0) {stop("Too few non-missing outcomes for modeling (n = 0). Check DIQ010 upstream.")}# (optional plots omitted for brevity)# save for downstreamif (!dir.exists("data")) dir.create("data", recursive =TRUE)saveRDS(adult, "data/adult_cleaned_2013_2014.rds")
Code
ggplot(adult, aes(x = age)) +geom_histogram(binwidth =5, fill ="skyblue", color ="white") +labs(title ="Distribution of Age >20 years",x ="Age (years)",y ="Count" ) +theme_minimal()
Code
ggplot(adult, aes(factor(diabetes_dx))) +geom_bar(fill ="steelblue") +labs(title="Diabetes Outcome Distribution in >20 years age group", x="diabetes_dx (0=No, 1=Yes)", y="Count")
Code
ggplot(adult, aes(factor(bmi_cat))) +geom_bar(fill ="steelblue") +labs(title="Diabetes Outcome Distribution by BMI in >20 years age group", x="bmi_cat")
Code
ggplot(adult, aes(x =factor(diabetes_dx), y = bmi)) +geom_boxplot(fill ="skyblue") +labs(title ="BMI Distribution by Diabetes Diagnosis in >20 years age group",x ="Diabetes Diagnosis (0 = No, 1 = Yes)",y ="BMI" ) +theme_minimal()
Code
# plots for adult data bmi categories and race categoriesggplot(adult, aes(x =factor(race), fill =factor(diabetes_dx))) +geom_bar(position ="dodge") +labs(title ="Diabetes Diagnosis by Race in >20 years age group",x ="Race/Ethnicity",y ="Count",fill ="Diabetes Diagnosis\n(0 = No, 1 = Yes)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Code
ggplot(adult, aes(x =factor(bmi_cat), fill =factor(diabetes_dx))) +geom_bar(position ="dodge") +labs(title ="Diabetes Diagnosis by BMI in >20 years age group",x ="BMI",y ="Count",fill ="Diabetes Diagnosis\n(0 = No, 1 = Yes)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Statistical Methods and Modeling
Multiple Logistic regression on survey weighted dataset
-We conducted frequentist method Multiple Logistic regression on a survey-weighted dataset, for complete case analysis and data exploration
Multivariate Imputation by Chained Equations (MICE)
We conducted MICE to manage missiging data - Considering the small sample size, Multivariate Imputation by Chained Equations (MICE) was conducted as an alternative to the Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
Multiple imputation (MI) is an alternative analytic approach for small dataset with missingness.
Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable
Multiple Imputation (MI) can be performed using mice package in R
Iterative MICE imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
Bayesian Logistic Regression
We conduct bayesian logisitic regression to estimate the association between BMI, age, sex, and race/ethnicity and predict doctor-diagnosed diabetes (DIQ010).
Bayesian statistics is about updating beliefs with evidence: Posterior ∝ Likelihood × Prior - Prior (p(θ)): Your initial belief about a parameter before seeing the data. - Likelihood (p(y|θ)): How probable the observed data are given the parameters. This is derived from the model (e.g., logistic regression likelihood). - Posterior (p(θ|y)): Your updated belief about the parameter after seeing the data. - We selected Bayesian Logistic Regression since our study response variable is a binary outcome (Diabetes:yes/no) - Bayesian Logistic Regression is based on binomial probability Bayes’ rules, and predicts probability of disease outcome - Bayes analyzes linear relation between the predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes). - it considers that predictors and response variables are independent. - Regression a of a discrete variable (0 or 1) is a Bernoulli probability model that classifies categorical response variables - predicting Diabetes. - Logit link provides probabilities for the response variable. - We use Weakly informative priors Normal (0, 2.5) for logistic regression coefficients and intercept, Normal(0, 10), allows a wide range of baseline log-odds and helps with convergence and avoids extreme estimates. Good default for most applications in social, health, or epidemiological studies. - In Bayesian statistics, every unknown parameter (like a regression coefficient, mean, or variance) is treated as a random variable with a probability distribution that reflects uncertainty.
Summaries od Bayesian regression include - -Posterior Predictive Probabilities -Posterior Mean, Median, credible Intervals -Posterior Probability (Outcome=1) -Comparison with External Prevalence (population prevalence) -Posterior Model Fit Metrics -Prior versus Posterior Coefficient Distributions -Posterior Predictive Checks -Uncertainty Quantification
Trace Plots for Markov Chain Monte Carlo Convergence Autocorrelation Plots Potential Scale Reduction Factor Assessment Posterior Predictive Checks Residual Analysis Bayes Factor Comparison Prior Sensitivity Analysis Model Fit Assessment Posterior Predictive Probability Plots Posterior Interval Coverage Evaluation Convergence Diagnostics across Chains
Code
# Modelinglibrary(broom)library(mice)library(brms)library(posterior)library(bayesplot)library(knitr)# --- Guardrails for modeling ---n_outcome <-sum(!is.na(adult$diabetes_dx))if (n_outcome ==0) stop("Too few non-missing outcomes for modeling. n = 0")# Ensure factors and >=2 observed levels among complete outcomesadult <- adult %>% dplyr::mutate(sex =if (!is.factor(sex)) factor(sex) else sex,race =if (!is.factor(race)) factor(race) else race )if (nlevels(droplevels(adult$sex[!is.na(adult$diabetes_dx)])) <2)stop("sex has <2 observed levels after filtering; check data availability.")if (nlevels(droplevels(adult$race[!is.na(adult$diabetes_dx)])) <2)stop("race has <2 observed levels after filtering; check Data Prep.")# ------------------------- 1) Survey-weighted complete-case -------------------------# Build a logical filter on the original adult data (same length as design$data)keep_cc <-with( adult,!is.na(diabetes_dx) &!is.na(age_c) &!is.na(bmi_c) &!is.na(sex) &!is.na(race))# Subset the survey design using the logical vector (same length as original)des_cc <-subset(nhanes_design_adult, keep_cc)# Corresponding complete-case data (optional)cc <- adult[keep_cc, ] |>droplevels()cat("\nComplete-case N for survey-weighted model:", nrow(cc), "\n")
Complete-case N for survey-weighted model: 5349
Code
print(table(cc$race))
NH White Mexican American Other Hispanic NH Black
2293 713 470 1101
Other/Multi
772
Code
print(table(cc$diabetes_dx))
0 1
4752 597
Code
print(table(cc$sex))
Male Female
2551 2798
Code
form_cc <- diabetes_dx ~ age_c + bmi_c + sex + racesvy_fit <- survey::svyglm(formula = form_cc, design = des_cc, family =quasibinomial())summary(svy_fit)
Call:
svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
Survey design:
subset(nhanes_design_adult, keep_cc)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.67143 0.11935 -22.383 1.68e-08 ***
age_c 1.10833 0.05042 21.981 1.94e-08 ***
bmi_c 0.63412 0.05713 11.099 3.88e-06 ***
sexFemale -0.63844 0.10926 -5.843 0.000386 ***
raceMexican American 0.71091 0.13681 5.196 0.000826 ***
raceOther Hispanic 0.46469 0.13474 3.449 0.008712 **
raceNH Black 0.51221 0.15754 3.251 0.011677 *
raceOther/Multi 0.84460 0.17756 4.757 0.001433 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.8455444)
Number of Fisher Scoring iterations: 6
Prior - We used student_t(3, 0, 10) prior R. V. D. Schoot et al. (2013) - student_t(ν,μ,σ)
Student’s t-distribution with: ν=3: degrees of freedom (controls tail heaviness) μ=0: location (center, like the mean) σ=10: scale (spread, like standard deviation) student_t(3, 0, 10) means: It’s centered at 0 It allows large values (± several times 10) It has heavy tails, since ν=3, allows for outliers or unexpected large parameter values Helps the model remain stable, especially with: # Small datasets # High correlation between predictors # Potential outliers in the data
priors <-c(set_prior("normal(0, 2.5)", class ="b"),set_prior("student_t(3, 0, 10)", class ="Intercept") )bayes_fit <-brm(formula = diabetes_dx |weights(wt_norm) ~ age_c + bmi_c + sex + race,data = adult_imp1,family =bernoulli(link ="logit"),prior = priors,chains =4, iter =2000, seed =123,control =list(adapt_delta =0.95),refresh =0# quiet Stan output)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
summary(bayes_fit) # Bayesian model summary
Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
prior_summary(bayes_fit)
prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
Code
###library(ggplot2)# priors for two coefficients (age and bmi) prior = N(0,2.5)prior_draws <-tibble(term =rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each =4000),value =c(rnorm(4000, 0, 2.5), rnorm(4000, 0, 2.5)))ggplot(prior_draws, aes(x = value, fill = term)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Prior Distributions for Coefficients",x ="Coefficient Value", y ="Density") +scale_fill_manual(values =c("skyblue", "orange"))
Code
# Diabetes vs BMIlibrary(ggplot2)# Create the plotggplot(adult_imp1, aes(x =factor(diabetes_dx), y = bmi, fill =factor(diabetes_dx))) +geom_boxplot(alpha =0.7) +scale_x_discrete(labels =c("0"="No Diabetes", "1"="Diabetes")) +labs(x ="Diabetes Diagnosis",y ="BMI",title ="BMI Distribution by Diabetes Status" ) +theme_minimal() +theme(legend.position ="none")
Code
# logistic regression curveggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +geom_point(aes(y = diabetes_dx), alpha =0.2, position =position_jitter(height =0.02)) +geom_smooth(method ="glm", method.args =list(family ="binomial"), se =TRUE, color ="blue") +labs(x ="BMI",y ="Probability of Diabetes",title ="Predicted Probability of Diabetes vs BMI" ) +theme_minimal()
Once we get posterior draws, we study Summary stats Mean, median, 95% credible intervals summary(bayes_fit) or posterior_summary(bayes_fit) - Visualization Distribution shape mcmc_hist(posterior, pars = c(“b_age”)) or mcmc_areas() - Pairwise plots Correlations between parameters mcmc_pairs(posterior) - Posterior predictive checks Compare model predictions vs observed data pp_check(bayes_fit) - Model comparison Using LOO or WAIC loo(bayes_fit) or waic(bayes_fit)
Assumptions for Bayesian logistic regression - posterior check - plots
for linearity - mcmc trace plots for convergence - bayes_R2 for model fit
Code
summary(bayes_fit)
Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Estimated R² = 0.13 (13.1%) - Predictors are relevant, most of the variability remains unexplained. Other factors (like genetics, lifestyle, environment, etc.) might be contributing. Below are reported odds ratio from the posterior predicted values and the Bayesian regression summary
# Alternative PP plots (histogram / barplot) for binary outcome (bar chart preferred being discrete outcome)ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ]) +labs(title ="Posterior Predictive Check: Barplot of Counts") +theme_minimal()
Code
#PP check for proportions (useful for binary) # mean comparison## to check if the simulated means match the observed mean## meanppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat ="mean") +labs(title ="Posterior Predictive Check: Mean of Replicates") +theme_minimal()
Code
## sdppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat ="sd") +labs(title ="PPC: Standard Deviation of Replicates") +theme_minimal()
Code
# PP checks with bayesplot optionscolor_scheme_set("blue")ppc_scatter_avg(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ]) +labs(title ="Observed vs Predicted (Avg) Posterior Predictive")
library(posterior)library(bayesplot)# Extract posterior draws as a draws_df # simulate posterior outcomespost <-as_draws_df(bayes_fit)# Check parameter namescolnames(post)
# Density overlay for age and bmimcmc_areas(post, pars =c( "b_age_c","b_bmi_c","b_sexFemale","b_raceMexicanAmerican", "b_raceOtherHispanic","b_raceNHBlack","b_raceOtherDMulti" ))
# Combine observed and predicted into one data frameplot_data <- adult_imp1 %>%mutate(predicted_bmi = predicted[, "Estimate"],lower_ci = predicted[, "Q2.5"],upper_ci = predicted[, "Q97.5"],obs_index =1:nrow(adult_imp1) # index for x-axis )# Line plotggplot(plot_data, aes(x = obs_index)) +geom_line(aes(y = bmi, color ="Observed")) +# observed BMIgeom_line(aes(y = predicted_bmi, color ="Predicted")) +# predicted BMIgeom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha =0.2) +# uncertaintylabs(x ="Observation", y ="BMI", color ="Legend") +theme_minimal()
Code
###summary(adult_imp1$bmi)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.1 24.1 27.7 29.0 32.4 82.9
Code
summary(plot_data$bmi_c)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.09476 -0.69669 -0.19338 -0.01133 0.46371 7.52398
Code
prior_summary(bayes_fit)
prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
term estimate type
Length:8000 Min. :-3.585220 Length:8000
Class :character 1st Qu.:-0.697054 Class :character
Mode :character Median : 0.000046 Mode :character
Mean :-0.007350
3rd Qu.: 0.668561
Max. : 3.922817
library(brms)library(tidyr)# Extract posterior drawspost <-as_draws_df(bayes_fit) %>%# bayes_fit = your brms modelselect(b_bmi_c, b_age_c) %>%# select your coefficient columnspivot_longer(everything(),names_to ="term",values_to ="estimate" ) %>%mutate(term =case_when( term =="b_bmi_c"~"BMI (per 1 SD)", term =="b_age_c"~"Age (per 1 SD)" ),type ="Posterior" )## visualization of prior and predicted drawscombined_draws <-bind_rows(prior_draws, post) library(ggplot2)ggplot(combined_draws, aes(x = estimate, fill = type)) +geom_density(alpha =0.4) +facet_wrap(~ term, scales ="free", ncol =2) +theme_minimal(base_size =13) +labs(title ="Prior vs Posterior Distributions",x ="Coefficient estimate",y ="Density",fill ="" )
Code
#### Compute proportion of diabetes=1 for each drawpp_proportion <-rowMeans(pp_samples) # proportion of 1's in each posterior draw# Summary of posterior proportionssummary(pp_proportion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09067 0.10533 0.10944 0.10935 0.11320 0.12715
Code
# Optional: visualize the posterior probability distributionpp_proportion_df <-tibble(proportion = pp_proportion)ggplot(pp_proportion_df, aes(x = proportion)) +geom_histogram(binwidth =0.01, fill ="skyblue", color ="black") +labs(title ="Posterior Distribution of Proportion of Diabetes = 1",x ="Proportion of Diabetes = 1",y ="Frequency" ) +theme_minimal()
Code
####library(tidyverse)# Posterior predicted proportion vector# pp_proportion <- rowMeans(pp_samples) # if not already doneknown_prev <-0.089# NHANES prevalence# Posterior summaryposterior_mean <-mean(pp_proportion)posterior_ci <-quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval# Create a data frame for plottingpp_df <-tibble(proportion = pp_proportion)# Plotggplot(pp_df, aes(x = proportion)) +geom_histogram(binwidth =0.005, fill ="skyblue", color ="black") +geom_vline(xintercept = known_prev, color ="red", linetype ="dashed", size =1) +geom_vline(xintercept = posterior_mean, color ="blue", linetype ="solid", size =1) +geom_rect(aes(xmin = posterior_ci[1], xmax = posterior_ci[2], ymin =0, ymax =Inf),fill ="blue", alpha =0.1, inherit.aes =FALSE) +labs(title ="Posterior Predicted Diabetes Proportion vs NHANES Prevalence",subtitle =paste0("Red dashed = NHANES prevalence (", known_prev, "), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),x ="Proportion of Diabetes = 1",y ="Frequency" ) +theme_minimal()
Code
library(dplyr)# Posterior predicted proportionposterior_mean <-mean(pp_proportion)posterior_ci <-quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval# NHANES prevalence with SE from survey::svymean# Suppose you already have:# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)known_prev <-0.089# Mean prevalenceknown_se <-0.0048# Standard error from survey# Calculate 95% confidence intervalknown_ci <-c( known_prev -1.96* known_se, known_prev +1.96* known_se)# Print resultsdata.frame(Type =c("Posterior Prediction", "NHANES Prevalence"),Mean =c(posterior_mean, known_prev),Lower_95 =c(posterior_ci[1], known_ci[1]),Upper_95 =c(posterior_ci[2], known_ci[2]))
Type Mean Lower_95 Upper_95
2.5% Posterior Prediction 0.1093519 0.09781831 0.1212446
NHANES Prevalence 0.0890000 0.07959200 0.0984080
Code
####library(ggplot2)library(dplyr)# Create a data frame for plottingci_df <-data.frame(Type =c("Posterior Prediction", "NHANES Prevalence"),Mean =c(0.1096674, 0.089),Lower_95 =c(0.09772443, 0.079592),Upper_95 =c(0.1210658, 0.098408))# Plotggplot(ci_df, aes(x = Type, y = Mean, color = Type)) +geom_point(size =4) +geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width =0.2) +ylim(0, max(ci_df$Upper_95) +0.02) +labs(title ="Comparison of Posterior Predicted Diabetes Proportion vs NHANES Prevalence",y ="Proportion of Diabetes",x ="" ) +theme_minimal(base_size =14) +theme(legend.position ="none")
Code
# Save your dataset as CSVwrite.csv(adult_imp1, "adult_imp1.csv", row.names =FALSE)
Results
Multiple Linear Regression
Model**:**
`svyglm(formula = form_cc, design = des_cc, family = quasibinomial())`
All predictors are significant: age (p < 0.001) and BMI (p < 0.001) show strong positive associations with the outcome, while being female (p = 0.0004) is negatively associated. Other significant associations include raceMexican American (p = 0.0008), raceOther Hispanic (p = 0.0087), raceNH Black (p = 0.0117), and raceOther/Multi (p = 0.0014).
MICE
All predictors are statistically significant.
Positive associations: age (p \< 0.001), BMI (p \< 0.001), and all
race categories compared to reference.
Negative association: being female (p \< 0.001)
Rhat = 1.00 for all parameters → excellent convergence
Bulk_ESS / Tail_ESS: Large values (>2000) → high effective sample sizes, reliable posterior estimates.
Interpretation
Strong predictors: Age and BMI are strongly positively associated with diabetes risk.
Sex effect: Females have a lower probability of diabetes compared to males
R² = 0.13 shows 13% of the variance in the outcome (diabetes_dx) is explained by your predictors (age, BMI, sex, race), 95% Credible Interval: 0.106–0.156, indicates that, given your model and data, the true proportion of variance explained is plausibly between 10.6% and 15.6% and shows uncertainty in the explained variance, which is natural for probabilistic models.
Est.Error = 0.0127, reflects the standard error of the R² estimate across posterior samples. The small SE indicates that the R² estimate is fairly precise.
Race/ethnicity: Mexican American, NH Black, and “Other/Diverse” groups have higher odds of diabetes. Other Hispanic group has a less certain effect.
age_c-Each 1-unit increase in centered age increases the log-odds of diabetes by 1.09. Strong positive effect.
bmi_c-Higher BMI is associated with higher diabetes risk. sexFemale-Females have lower odds of diabetes compared to males.
raceMexicanAmerican-Higher odds of diabetes vs. reference race (likely NH White)
raceOtherHispanic-Slightly higher odds vs reference, but interval crosses zero → uncertain effect.
raceNHBlack-Significantly higher odds of diabetes compared to reference. raceOtherDMulti-Higher odds of diabetes vs reference group.
Posterior distribution of all parameters in the model. (1)Density
plot of posterior samples each parameter (e.g., intercept, slope) into a smoothed density curve, showing most of the posterior probability mass lies for best estimates and uncertainty.
Posterior Predictive Distribution - generated from posterior
predictive draws: 𝑦rep∼𝑝(𝑦new∣𝜃)yrep∼p(ynew∣θ) simulate the data given posterior parameter estimates.Posterior predictive checks (PPC) compare these simulations to real data to assess model fit.
##Incorporating Uncertainty two sources of uncertainty: Parameter uncertainty: captured in the posterior distributions Predictive uncertainty: captured in posterior predictive draws
Combining the two provide credible intervals for predictions, not just point predictions and specifies - Given the BMI, the probability of diabetes is likely between 40–55%.”
Comparing Models
All three models (survey-weighted MLE, multiple imputation, Bayesian) agree closely on the direction and magnitude of the effects of BMI and age.
Age is a stronger predictor than BMI, nearly tripling the odds per 1 SD.
BMI significantly increases diabetes risk (~1.7–1.9× per 1 SD).
Differences between models are minor, indicating robust and reliable findings despite missing data or modeling approach.
Diabetes Predicted proportion vs population prevalence
The posterior predicted proportion is slightly higher than the observed NHANES prevalence (10.97% vs 8.9%).
Intervals overlap slightly, but the posterior tends to overestimate diabetes compared to NHANES.
The difference could be due to:
Model assumptions (logistic link, priors)
Predictor effects (BMI, age, sex, race)
Sample characteristics vs population weighting in NHANES
The results find our model is reasonable but slightly conservative (predicting higher risk) relative to the observed population prevalence.
Conclusion
Across multiple modeling approaches—survey-weighted maximum likelihood, multiple imputation, and Bayesian regression—both age and BMI were consistently strong predictors of diabetes. Eachstandard deviation increase in age nearly tripled the odds of diabetes, while a similar increase in BMI elevated the odds by approximately 1.7–1.9 times. The consistency of these results across models highlights the robustness of the associations and underscores the importance of age and BMI as key risk factors for diabetes in this population.
Effect stability: point estimates in rhe Bayesian model’s closely aligned with those from the frequentist, indicating that prior regularization stabilized the estimates in the presence of modest missingness.
Uncertainty quantification: Bayesian credible intervals of odds ration were slightly narrower yet overlapped the frequentist confidence intervals, suggest comparable inferential precision while offering improved interpretability.
Design considerations: # Survey-weighted MLE (Maximum Likelihood Estimator) - incorporates each observation weighted according to its survey weight. - provide estimates that reflect the population-level parameters, not just the sample- produces population-representative estimates. # Bayesian model with normalized weights- - instead of fully modeling the survey design, it used normalized sampling weights as importance weights - the scaled weights that sum to the sample size approximates the effect of survey weights, but does not fully account for: Stratification, clustering, design-based variance adjustments. - Bayesian inference treats the weighted likelihood as from a regular model, ignoring some survey design features.
Autocorrelation Interpretation Autocorrelation shows how correlated a sample is with previous samples (lags) in the MCMC chain. Lag indicates the number of steps between samples. Lag 0 is always 1 (perfect correlation with itself). Each chain (1–4) shown separately for b_age_c (age coefficient) and b_bmi_c (BMI coefficient) presents autocorrelation drops quickly to near zero after lag 1–2 for both coefficients in all chains suggesting good mixing: successive samples are mostly independent after a short lag. No persistent high autocorrelation indicates MCMC chains are converging well.Low autocorrelation, as in your plot, is desirable because it means your posterior estimates are reliable and not biased by chain dependence.
Discussions
The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The = Across all models, both age and BMI emerged as strong and consistent predictors of diabetes. The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates. align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes. Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence. Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.
Limitations
The study is based on cross-sectional/observational NHANES data, which limits the ability to make causal inferences. Associations observed between BMI, age, diabetes status cannot confirm causation.
The project relies on multiple imputation for missing values, even though imputation reduces bias, it assumes missingness is at random (MAR); if data are missing not at random (MNAR), results may be biased.
Potential Residual Confounding - Models included key predictors (age, BMI, sex, race), but unmeasured factors like diet, physical activity, socioeconomic status, or genetic predisposition could confound associations.
Bayesian models depend on prior choices, which could influence posterior estimates if priors are informative or mis-specified.
Variable Measurement - BMI is measured at a single time point, which may not reflect long-term exposure or risk.
Self-reported variables - are subjective to recall or reporting bias.
Interactions are not tested in the study (bmi × age) and so other potential synergistic effects might be missed.
Predicted probabilities are model-based estimates, not validated for clinical decision-making. External validation in independent cohorts is needed before application.
QUESTION for targeted therapy
Translational Perspective from the Bayesian Diabetes Prediction Project. This project further demonstrates the translational potential of Bayesian modeling in clinical decision-making and public health strategy. By using patient-level predictors such as age, BMI, sex, and race to estimate the probability of diabetes, the model moves beyond descriptive statistics toward individualized risk prediction. The translational move lies in converting these probabilistic outputs into actionable thresholds—such as identifying the BMI or age at which the predicted risk of diabetes exceeds a clinically meaningful level (e.g., 30%). Such insights can guide early screening, personalized lifestyle interventions, and targeted prevention programs for populations at higher risk. This approach embodies precision public health—bridging data science and medical decision-making to deliver tailored, evidence-based strategies that can ultimately improve diabetes prevention and management outcomes.
What changes in modifiable predictors would lower diabetes risk?
Translational Research Implications:
We now use the model to guide prevention or intervention.
Only BMI is a modifiable risk factor here
What must change in BMI, behavior, or lifestyle to achieve a lower risk threshold? In practice, we hold non modifiable predictors as constant (sex, race). Vary modifiable predictors (BMI) until the model predicts the desired probability.
Code
# Use the first participant # using multiple covariates to select someoneparticipant1_data <- adult[1, ]# predicted probabilities for patient 1phat1 <-posterior_linpred(bayes_fit, newdata = participant1_data, transform =TRUE)# 'transform = TRUE' gives probabilities for logistic regression# Store in a data frame for plottingpost_pred_df <-data.frame(pred = phat1)# Compute 95% credible intervalci_95_participant1 <-quantile(phat1, c(0.025, 0.975))# Plotggplot(post_pred_df, aes(x = pred)) +geom_density(color='darkblue', fill='lightblue') +geom_vline(xintercept = ci_95_participant1[1], color='red', linetype='dashed') +geom_vline(xintercept = ci_95_participant1[2], color='red', linetype='dashed') +xlab('Probability of being diabetic (Outcome=1)') +ggtitle('Posterior Predictive Distribution 95% Credible Interval') +theme_bw()
Code
participant2_data <- adult[2, ]# predicted probabilities for patient 1phat2 <-posterior_linpred(bayes_fit, newdata = participant2_data, transform =TRUE)# 'transform = TRUE' gives probabilities for logistic regression# Store in a data frame for plottingpost_pred_df2 <-data.frame(pred = phat2)# Compute 95% credible intervalci_95_participant2 <-quantile(phat2, c(0.025, 0.975))# Plotggplot(post_pred_df2, aes(x = pred)) +geom_density(color='darkblue', fill='lightblue') +geom_vline(xintercept = ci_95_participant2[1], color='red', linetype='dashed') +geom_vline(xintercept = ci_95_participant2[2], color='red', linetype='dashed') +xlab('Probability of being diabetic (Outcome=1)') +ggtitle('Posterior Predictive Distribution 95% Credible Interval') +theme_bw()
Code
library(ggplot2)new_participant <-data.frame(age_c =40,bmi_c =25,sex ="Female",race ="Mexican American")# Posterior predicted probabilitiesphat_new <-posterior_linpred(bayes_fit, newdata = new_participant, transform =TRUE)# Convert to numeric vectorphat_vec <-as.numeric(phat_new)# Check the range to see if all values are similarrange(phat_vec)
[1] 1 1
Code
# Store in a data framepost_pred_df_new <-data.frame(pred = phat_vec)# Compute 95% credible interval from the vectorci_95_new_participant <-quantile(phat_vec, c(0.025, 0.975))# Plotggplot(post_pred_df_new, aes(x = pred)) +geom_density(color='darkblue', fill='lightblue', alpha =0.6) +geom_vline(xintercept = ci_95_new_participant[1], color='red', linetype='dashed') +geom_vline(xintercept = ci_95_new_participant[2], color='red', linetype='dashed') +xlim(0, 1) +# ensures you see the curve even if values are closexlab('Probability of being diabetic (Outcome=1)') +ggtitle('Posterior Predictive Distribution (95% Credible Interval)') +theme_bw()
Code
# Grid of possible BMI values (centered if model used bmi_c)bmi_seq <-seq(18, 40, by =0.5)newdata_grid <-data.frame(age_c =40,bmi_c = bmi_seq,sex ="Female",race ="Mexican American")# Posterior mean predicted probabilitiespred_probs <-posterior_linpred(bayes_fit, newdata = newdata_grid, transform =TRUE)# Average over posterior draws to get the mean predicted probability per BMIprob_mean <-colMeans(pred_probs)# Combine with BMI valuespred_df <-cbind(newdata_grid, prob_mean)target_prob <-0.3closest <- pred_df[which.min(abs(pred_df$prob_mean - target_prob)), ]closest
age_c bmi_c sex race prob_mean
1 40 18 Female Mexican American 1
Code
ggplot(pred_df, aes(x = bmi_c, y = prob_mean)) +geom_line(color ="darkblue", linewidth =1.2) +geom_hline(yintercept = target_prob, color ="red", linetype ="dashed") +geom_vline(xintercept = closest$bmi_c, color ="red", linetype ="dotted") +annotate("text", x = closest$bmi_c, y = target_prob +0.05,label =paste0("Target BMI ≈ ", round(closest$bmi_c, 1)),color ="red", hjust =-0.1) +labs(x ="BMI (centered or raw, depending on model)",y ="Predicted Probability of Diabetes",title ="Inverse Prediction: BMI Needed for Target Diabetes Risk" ) +theme_bw()
Implications
age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.
References
Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.”Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.”Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.”National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.”Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.”Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.”Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. “Augmenting Data With Published Results in Bayesian Linear Regression.”Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. “Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.”International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.”Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Schoot, Rens Van De, David Kaplan, Jaap Denissen, Jens B Asendorpf, and Marcel A G Van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.”https://doi.org/10.1111/cdev.12169.